Skip to content

Cohort jobs: full per-run lineage with source dedup#93

Merged
jirhiker merged 2 commits into
mainfrom
docs/analyte-fetch-limitation
Jun 28, 2026
Merged

Cohort jobs: full per-run lineage with source dedup#93
jirhiker merged 2 commits into
mainfrom
docs/analyte-fetch-limitation

Conversation

@jirhiker

@jirhiker jirhiker commented Jun 28, 2026

Copy link
Copy Markdown
Member

Reworks the job layout so each run shows the full sources → combine → geoserver lineage while still unifying each shared source only once.

Problem

The shared-source dedup (#92) split sources into their own sources_job and made product jobs publish-only (combine + geoserver). That deduped fetches but hid lineage: a product job's graph showed only <product> → geoserver, sources nowhere in the run.

Why full lineage + dedup needs grouping

Cross-product dedup is impossible across separate runs — if two products that share arsenic/summary/wqp run as two runs, each must materialize it → duplicated. So the sharing products must run together. That grouping is a cohort.

Change

  • Cohort = (group, mode, scope) — the products that can share source assets. One job per cohort materializes its whole graph in one run: each shared source unifies once (single asset key, selected once), every member combine reads it back through the GCS IO manager, geoserver publishes.
  • 5 cohorts; the 87 source assets partition across them (no source materialized by two jobs → dedup preserved). Cohort cron = earliest member schedule.
  • Replaces sources_job + per-product publish jobs.
cohort job sources combines sched
waterlevels_summary_state_NM 9 1 (summary) 06:00
waterlevels_timeseries_state_NM 9 3 (timeseries, trends, recency) 07:00
waterlevels_timeseries_county_Bernalillo 1 1 (bernco) 08:00
analytes_summary_state_NM 59 4 (arsenic, tds, major-chem, mcl) 09:00
analytes_timeseries_state_NM 9 2 (arsenic-trend, nitrate-trend) 13:00

API load unchanged: sources still fetched once each; publishes inline into the same run.

Also

  • Documents the remaining per-analyte fetch limitation (a source repeats per analyte because the backend unifies one parameter per pass — collapsing it needs a backend multi-analyte change, out of scope here).

Verification

dg check defs clean; 277 offline tests pass; 5 cohort jobs each show full lineage; source assets partition with no cross-job duplication.

🤖 Generated with Claude Code

The shared-source dedup collapses duplication across products but not
across analytes: a source appears once per analyte because the backend
unifies one parameter per pass. Note the future multi-analyte-unification
optimization (a backend change) in the module docstring; out of scope for
the orchestration graph.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
die-orchestration View in Cloud Jun 28, 2026 at 11:06 PM (UTC)

The sources_job + publish-only product jobs preserved dedup but hid each
product's source lineage (the product job showed only combine → geoserver).
Replace that with one job per cohort, keyed (group, mode, scope): the
products that can share source assets. A cohort job materializes its whole
graph in one run — each shared source unifies once (one asset key, selected
once), every member combine reads it back through the GCS IO manager, then
geoserver publishes. Full sources → combine → geoserver lineage per run, no
duplicated source fetch.

5 cohorts (waterlevels/analytes × summary/timeseries × scope); the 87
source assets partition across them so no source is materialized by two
jobs. Cohort cron = earliest member schedule. The tolerant IO manager now
only matters for ad-hoc combine-only materializations from the UI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jirhiker jirhiker changed the title Document per-analyte source-fetch limitation Cohort jobs: full per-run lineage with source dedup Jun 28, 2026
@jirhiker jirhiker merged commit 8c5a75c into main Jun 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant